Method and system for morphologizing text

ABSTRACT

A method and system for morphologizing written or printed texts, including Japanese texts are obtained in accordance to codes. The longest morphemes are divided one at a time from the characters in a sentence. This is achieved by forming the longest morpheme from the remaining characters in the sentence which is listed in a dictionary of valid morphemes and determining if it is conjunctive with the previously divided morpheme. To determine if a formed morpheme is conjunctive, associated pairs of front and back connection codes are retrieved. If a front connection code of one retrieved pair and a back connection code of a pair of connection codes of the previously divided morpheme are co-listed in a table of permissible relationships, the formed morpheme is conjunctive. If no character may be divided from the remaining characters in the sentence, a previously divided morpheme is redivided. If a morpheme can be divided and is conjunctive with the previous morpheme, a connection action, describing the relationship between the formed morpheme and the previously divided morpheme, is recorded. In response to certain connection actions, the next morpheme is divided by forming it from a single character of the remaining characters and testing it. After all of the morphemes are divided, a word graph is constructed from the morphemes in accordance with the connection actions relating adjacent morphemes.

FIELD OF THE INVENTION

The present invention relates to character recognition and moreparticularly to a method and system for segmenting written or printedtexts into morphemes.

BACKGROUND OF THE INVENTION

FIG. 1 depicts an optical character recognition system (OCR) 10 havingan optical scanner 12 connected to a data processing system 14. Theoptical scanner 12 scans a written or printed page of text and reads theindividual characters printed or written thereon. Typically, the scanner12 is capable of recognizing any character from a predefined set andreturning a symbolic representation, associated with each scannedcharacter. These codes are fed to the data processing system 14 forfurther processing.

The data processing system 14 shown in FIG. 1 comprises a CPU 16interconnected to a main memory 18 via a bus 20. Also connected to thebus 20 are a disk memory 22 and an I/O interface 24. The I/O interface24 is connected to the optical scanner 12. In this manner, the dataprocessing system 14 may receive the characters read in by the opticalscanner 12 via the I/O interface 24.

Frequently, it is desired to process printed text. The text itself maybe scanned by the OCR system to extract the individual characters. Fromthere, the symbolic representations of each character are transmitted tothe data processing system for further processing in accordance with adesired application such as translating the written text to anotherlanguage or interpreting the written text.

Occasionally it is desirable to morphologize the text or to segment thewords of the text into morphemes and then relate each morpheme to oneanother. A morpheme is an indivisible word fragment which can conveymeaning. For instance, the word "gun" is a morpheme. It conveys acertain meaning to the reader and cannot be divided into a smaller unitand still convey the same meaning. The word "guns," on the other hand,has two morphemes "gun" and "s." The first morpheme conveys the samemeaning as before. But, by placing the second morpheme "s" adjacent tothe first morpheme, the sense of plurality is conveyed. Many morphemesmay be placed into a single word to convey a more complicated meaning.For instance, the word "gunfighter" contains three morphemes "gun,""fight" and "er."

As can be seen above, some morphemes can stand alone as words such as"gun" and "fight." Others are word fragments and cannot stand alone suchas "s" and "ing." Still, others appear to be composed of furtherdivisible units such as "together" or the phrase "in order to". However,these phrases and words, are also morphemes as they cannot be furtherdivided without drastically changing their meaning.

In the English language it is often a simple task to determine where aparticular word begins and ends simply by using the "space" character asa word delimiter. This is not so simple in other languages such asJapanese, Chinese or Korean. Each character within a sentence is chosenfrom a large set and is approximately evenly spaced in relation to theother characters. Further, the placement of one character in a sentencemay drastically alter the way characters are parsed by the reader intowords. Nonetheless, these texts exhibit morpheme properties. Charactersalone, or strings of characters, form indivisible units or morphemeswhich convey a particular meaning. Again, the morpheme may be a wholephrase, word, word fragment or semantic unit which merely conveysinformation regarding another morpheme.

The goal of morphology is to segment a text into morphemes and thenrelate the morphemes to one another. Several morphology methods havebeen previously disclosed (see, e.g., Japanese Pat. No. 61-210479,Japanese Pat. No. 60-20234).

Certain languages, such as Japanese, have a structured grammar in whichtwo adjacent morphemes must obey certain rules of connection. Theserules dictate whether two morphemes may be placed adjacent to oneanother. Two prior morphology methods have exploited the connectionrelationship between morphemes in the Japanese language. A firstapproach, called Longest Match, segments the sentence of characters ordivides the sentence into morphemes, one at a time. Initially, theparser attempts to string together the longest series of characters,starting from the beginning of the sentence, which is listed in aJapanese dictionary as a recognizable morpheme. It is then determinedwhether this morpheme may be connected to the beginning of the sentence.If this morpheme cannot be the first morpheme in the sentence, (i.e.,the above-generated morpheme is not conjunctive with the beginning ofthe sentence) the parser returns to the step in which the longestmorpheme was formed. Then, the second longest morpheme is formed andthen tested in the above manner.

After the first morpheme is divided, a second morpheme is divided in asimilar manner. The longest possible morpheme is formed from theremaining characters of the sentence starting with the characterfollowing where the first morpheme left off. The second morpheme soformed is tested to determine if it is conjunctive with the firstmorpheme. If the two morphemes are not conjunctive, the second morphemeis reformulated, as described above, i.e. the second morpheme is formedfrom the next to longest string of characters remaining in the sentenceand tested. It may be appreciated that reformulation may occur for anymorpheme in the parsing analysis. Further, although several morphemesmay appear to be successfully divided initially, it may later turn outthat one was incorrectly divided. The parser therefore has the abilityto backtrack to any step in the parsing process including backtrackingto redivide any previously segmented morpheme. FIG. 2 illustrates such acase.

Depicted therein is an exemplary goal tree illustrating possible statesof the above-described Longest Match process. Each node of the treedepicts a possible state of the process after a morpheme is divided froma sentence of characters represented by the characters A through M. Theroot node 400 of the goal tree represents the state of a sentence withno divided morphemes. The root node 400 has three children nodes 426,406, 401 showing three possible divisions of the first morpheme from thesentence in accordance with the above stated criteria (caret marks,i.e., "Λ", delimit divided morphemes in FIG. 2). Each child node 426,406, 401 depicts the state of the sentence after the formation of one ofthree morphemes from the sentence which is both listed in the dictionaryand conjunctive with the beginning of a sentence. Thus, "A," "ABC" and"ABCD" are all morphemes listed in the dictionary which are conjunctivewith the beginning of a sentence.

As depicted in FIG. 2, of the morphemes that may be formed with thecharacters A-M which are conjunctive with the beginning of a sentence,"ABCD" is the longest. In the Longest Match process, this would be thefirst attempted division of the first morpheme. Thus, node 401 depictsthe first state of the Longest Match process.

The Longest Match process then proceeds to divide a second morpheme fromthe characters E-M. As depicted by nodes 402 and 403, only two morphemesmay be formed from the characters E-M which are both listed in thedictionary and conjunctive with the first morpheme "ABCD." Again,however, the Longest Match process forms and tests morphemes in theorder of decreasing length. Thus, the longer morpheme, "EFGH," would beformed from the remaining characters E-M and successfully tested. Thiswould result in the state as depicted by node 402, with the morphemes"ABCD" and "EFGH" divided from the sentence.

As depicted in FIG. 2, node 402 has no children. This means that nomorpheme listed in the dictionary can be formed with the remainingcharacters I-M or that if morphemes can be formed, they are notconjunctive with the morpheme "EFGH." Hence, the Longest Match parser"backtracks" to the state of node 401. With the state of the LongestMatch parser as depicted in node 401 the parser attempts to form adifferent second morpheme which is conjunctive with "ABCD" from thecharacters E-M. As depicted in node 403, the next longest morpheme whichis conjunctive to "ABCD" is "EF." The Longest Match process thereafterattempts to divide the third and fourth morphemes ("GH," "IJK,"respectively) traversing the states of nodes 403 to 405 in a similarmanner.

When node 405 is reached, the parser determines that no fifth morphememay be formed from the remaining characters L and M which is conjunctivewith the fourth morpheme "IJK." The parser then backtracks to node 404to redivide the fourth morpheme from the characters I-M. Since no othermorphemes may be formed therefrom which are conjunctive with themorpheme "GH", the parser backtracks to node 403 to redivide the thirdmorpheme from the characters G-M. Since that is not possible, the parserbacktracks to node 401 to redivide the second morpheme from thecharacters E-M. Since, all of the possible morphemes which areconjunctive with "ABCD" have been tried, the parser backtracks to theroot node 400 to redivide the first morpheme from the characters A-M. Atthis point, the first morpheme is redivided as "ABC" as depicted at node406. It may be appreciated, that processing continues in the abovedescribed manner through the numbered nodes in numerical order untilnode 425 is reached. At this point, all of the morphemes are dividedfrom the sentence.

As can be seen from the goal tree of FIG. 2, the Longest Match processuses very limited criteria to form morphemes. The morphemes are formedin decreasing order of length and then tested to determine if they areconjunctive. Occasionally, however, a morpheme originally thought to becorrect, later turns out to be incorrect. At this point, the LongestMatch process backtracks to redivide the morphemes of the sentence inthe reverse order in which they were divided. As such, the efficiency ofthe algorithm suffers as many poor choices are selected in thesegmentation process which are not discovered until much later on.

The other approach, called the Parse List Method, attempts to moreselectively search for the morphemes of the sentence. At each state ofthe solution, the parser, in this method, determines all of the possiblechoices for dividing the next morpheme from the remaining characters inthe sentence. For instance, using the example of FIG. 2, the parserwould first determine that three morphemes "A," "ABC" and "ABCD" couldbe formed from the characters A-M and are conjunctive with the beginningof the sentence.

Each partial solution is assigned a weight in accordance with someformula for determining the likelihood of segmenting the entire sentenceusing this particular partial solution. The partial solution that seemsbest suited to succeed is selected and processing continues on thispartial solution. All potential second morphemes are formed from theremaining characters following the first morpheme which are listed inthe dictionary and conjunctive with the first morpheme. For instance,assume "ABCD" appeared most likely to lead to a fully segmentedsentence. Both "EF" and "EFGH" would be formed and tested. Each newpartial solution thereby formed is also assigned a weight. Again, all ofthe weights of all of the partial solutions are compared and the partialsolution which seems best suited to succeed is continued. It may appear,for instance, at this stage, that the first divided morpheme, "ABCD,"which previously looked promising will not inevitably lead to a finalsolution (i.e. a fully segmented sentence). In such a case, the mostpromising partial solution will be one of the other states having adifferent first divided morpheme, i.e. the states of nodes 426, 406 ofFIG. 2. Thus, the most promising partial solution is continued i.e., allof the choices for the next parsed morpheme of this partial solutionwill be explored and assigned weights. For example, suppose aftercomparing the weights of nodes 402, 403, 406 and 426, node 406 appearsto have the most promising solution. In such a case, nodes 407, 416 areexamined by dividing "D" and "DEF" from the sentence as the secondmorpheme. This process continues until the final solution is achieved.

The Parse List Method, although theoretically an optimal process, provesto be inadequate in practice. This is because the computation of weightscannot be 100% accurate. Further, at each stage of the morphologicalprocess, every potential morpheme of the most promising partial solutionmust be evaluated and assigned a weight before processing continues.This reduces efficiency by requiring the evaluation of many choices forthe next morpheme in a sentence. Finally, the formula which determinesthe weights for each partial solution may have a significant timerequirement to calculate the weight for each partial solution. All ofthese considerations reduce the efficiency of the algorithm.

It is therefore an object of the present invention to provide a methodand system for morphologizing texts which is efficient and reduces theamount of backtracking.

SUMMARY OF THE INVENTION

The present invention is directed to an efficient method and system formorphologizing texts such as Japanese texts. The present inventionexploits the connection relationships imposed by the rules of grammar ofthe language in which the text is written. Illustratively, theconnection relationships are implemented using connection codes andconnection action codes. For instance, every morpheme in the Japanesedictionary may be assigned a pair or several pairs of connection codessuch as depicted in FIG. 3. As depicted in FIG. 3, each pair of codeshas a front connection code and a back connection code. For example, themorpheme 1 has the front connection code I and the back connection code54. The morpheme 2 has the front connection code 199 and the backconnection code 138. The front connection code serves to relate theparticular morpheme to a prior morpheme in the sentence. Conversely, theback connection code serves to relate the particular morpheme to asubsequent morpheme in a sentence.

All of the permissible morpheme relationships in the Japanese languagemay be tabulated as depicted in FIG. 4. The table segment shown in FIG.4 is a partial listing of such a table. Each table entry iscross-referenced by a pair of codes comprising one connection code fromeach of two adjacent morphemes. The first code is the back connectioncode of the previous morpheme and the second code is the frontconnection code of the subsequent morpheme. In addition to storing allof the permissible relationships of morphemes, the table also stores anentry describing the relationship between the two adjacent morphemescalled a connection action. As depicted in FIG. 4, each pair ofconnection codes indexes a connection action code.

The table of FIG. 4 tabulates the rules of grammar which dictate whentwo morphemes may be placed adjacent to one another. Two morphemes areconjunctive, i.e. may be placed adjacent to one another, only if theirrespective connection codes are co-indexed in the table of permissiblerelationships. The contra-positive is also true; if the respectiveconnection codes of two morphemes are not co-indexed in the table ofpermissible relationships, then the words cannot be placed adjacent toone another. Hence, when two adjacent morphemes are identified, byconsulting a table of permissible relationships, one may determinewhether the morphemes are placed in accordance with the rules ofgrammar. Further, one may determine the relationship between the twomorphemes by retrieving the connection action entry in the table.

In the operation of the invention, a sentence of characters is segmentedor divided into morphemes, one morpheme at a time. Initially, thelongest possible sub-string of characters, which is listed in adictionary, is divided from the remaining characters in the sentence.Preferably, the longest morpheme is obtained using a "pattern table"which is a graph or tree structure of interconnected character nodes.The pattern table is described in detail below.

The longest obtained morpheme is not necessarily grammaticallyconjunctive with (i.e., cannot be placed adjacent to) the morphemespreviously divided from the sentence and must first be tested. To thatend, all pairs of front and back connection codes of this morpheme areillustratively retrieved from a first table. A determination is thenmade if this morpheme is grammatically conjunctive with the previousmorpheme (or the beginning of the sentence if there are no previouslydivided morphemes in the sentence). This is achieved by consulting atable of permissible relationships or connection action code table usingevery permutation of one front connection code from each pair ofconnection codes associated with the untested morpheme and one backconnection code selected from each pair associated with the previousmorpheme. A default back connection code is supplied if there are nopreviously divided morphemes in the sentence.

The connection action code entry, indexed by each above-describedpermutation of connection codes selected from the untested morpheme andthe previously divided morpheme, is retrieved from a second table, ifpresent. If a particular front connection code, selected from a pair ofcodes of the untested morpheme, does not co-index a connection actioncode in any above-described permutation, the entire pair of codes, fromwhich this particular code was selected, is eliminated. For example,suppose a tested morpheme has three pairs of front and back connectioncodes (a,b), (c,d) and (e,f) of which "a", "c" and "e" are frontconnection codes. If "c" does not co-index a connection action code withany back connection code of the previous morpheme the pair (c,d) iseliminated.

If at least one connection action code is present, the morphemes may beconjunctive with one another and form a relationship. If no connectionaction codes are present for any permutation, the morphemes may not beplaced adjacent to one another. This indicates that the above-formedlongest morpheme is not correct, and that reformulation of the morphemeis necessary. In such a case, the next to the longest morpheme assembledfrom the sentence of characters should be formed and tested.

If at any time there are leftover characters at the end of the sentencethat do not form a morpheme listed in the dictionary or do not form aconjunctive morpheme (i.e., a morpheme which may be placed adjacent tothe previously divided morpheme), backtracking to the previously dividedmorpheme occurs. In other words, a previously divided morpheme, whichwas originally thought to be correct, will be redivided. Backtrackingmay occur at any failed stage of processing in order to "undo" i.e.,redivide several previously divided morphemes in the sentence. Once there-division step is completed, the morphologizing steps continue asbefore.

If a connection action code is successfully retrieved, meaning that themorpheme may be conjunctive with the previous morphemes, it is recordedand the process repeats itself. In other words, the next morpheme isdivided from the remaining characters of the sentence. If particularconnection action codes are retrieved then the division of the nextmorpheme is achieved by forming the next morpheme from the single nextremaining character. This procedure, referred to as look aheadprocessing, illustratively exploits a particular property of Japanesegrammar--that a particular single character must follow as the nextmorpheme in certain contexts.

After all of the morphemes are divided from the sentence, connectionactions, associated with each connection action code, are executed. Inthe execution step, morphemes are placed into a word graph whichcollocates the morphemes in accordance with the roles they play in thesentence. Knowledge information, associated with each morpheme, isillustratively consulted to assist the construction of the graph.Thereafter, the graphed sentence may be further processed in accordancewith a desired application such as translation, understanding, etc.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1 depicts an optical character recognition (OCR) system.

FIG. 2 schematically shows a goal tree depicting the states of a longestmatch parser.

FIG. 3 depicts the relationship of adjacent morphemes, their connectioncodes, and connection action codes.

FIG. 4 depicts a segment of a table of permissible morphemerelationships.

FIG. 5 schematically illustrates a morphology process according to thepresent invention.

FIG. 6 depicts a pattern table segment in the form of a tree.

FIG. 7 depicts a format for storing morpheme connection codes andknowledge.

FIG. 8 depicts a connection action code table according to the presentinvention.

FIG. 9 generally illustrates the connection action codes and respectiveprocedures executed in response thereto by step 118 of FIG. 5.

FIG. 10 depicts the state of selected steps of a sample execution of themorphology process according to the present invention.

FIG. 11 depicts the segmented sentence after the sample execution shownin FIG. 10.

FIG. 12 depicts the state of the segmented sentence shown in FIG. 11after executing the connection actions.

DESCRIPTION OF THE INVENTION

FIG. 5 depicts a flow chart of a program executed by the CPU 16 (FIG.1). The program may be executed using C language or LISP. Although theexecution of the program is described in connection with morphologizinga Japanese text, the method of the present invention is also applicableto other texts such as Korean and Chinese texts.

Initially, execution begins with step 100 where the CpU 16 (FIG. 1)stores a sentence in a variable referred to as rest string. The reststring variable stores the characters of a sentence in the order inwhich they appear on the scanned document. As morphemes are divided fromthe sentence, the characters comprised in each divided morpheme areremoved from the rest string. Hence, the rest string variable stores theundivided characters remaining in the inputted sentence at any point inthe execution of the morphology process. Additionally, in this step 100,the variable i is initialized to one. This variable indicates whichmorpheme in the sentence is currently being divided.

Execution in the CPU 16 (FIG. 1) then continues with step 102 where thelongest series of characters from the rest string are assembled,starting from the first character in the sentence, to form the longestpossible morpheme. In this process, a Japanese dictionary 101 isconsulted to ensure that only a morpheme which exists in the Japaneselanguage is assembled from the remaining characters. The Japanesedictionary 101 is stored in memory 18 or 22 of FIG. 1. Preferably, toperform this step efficiently, the longest possible morpheme isassembled using the aid of a pattern table as depicted in FIG. 6.

As depicted in FIG. 6, the pattern table 200 comprises several tree datastructures and may be stored in memory 18 or 22 (FIG. 1). One tree, forexample, the tree 200-1, is provided to store all of the Japanesemorphemes that begin with each character of the Japanese character set.Each root node 200-2, therefore, is associated with a unique characterof the Japanese character set. The remaining nodes 200-3 to 200-8 ofeach tree correspond either to other characters or to a morphemedelimiter. Only characters are stored in intermediary nodes 200-3 to200-5, 200-8 and only morpheme delimiters are stored at terminal nodes200-6 to 200-7.

Each tree 200-1 is constructed such that a morpheme may be retrieved bytraversing any path (e.g., 200-2, 200-3, 200-4, 200-6) within the treefrom the root node 200-2 to a terminal node 200-6, 200-7 and recordingthe character associated with each node visited in the order of thetraversal. The CPU 16 (FIG. 1) may compute the longest morpheme that canbe formed from the remaining characters in the sentence by traversingthe nodes 200-2 to 200-7 of a tree 200-1 corresponding to each characterremaining in the sentence as they appear. For instance, the CPU 16(FIG. 1) first executes a step which selects a tree 200-1 from thepattern table 200 whose root node 200-2 corresponds to the firstcharacter of the rest string (i.e., the next undivided character in thesentence). Thereafter, using the CPU 16 (FIG. 1), the tree 200-1 istraversed from the root node 200-2 to the child node 200-3 or 200-5corresponding to the next character in the rest string (i.e., the secondunparsed character). Then, the CPU 16 (FIG. 1) traverses the tree 200-1from the present child node, e.g., 200-3, to its child, e.g., 200-4,corresponding to the third character in the rest string, and so on. Uponreaching a node 200-2 to 200-7 which has no children nodes correspondingto the next character in the rest string, the CPU 16 (FIG. 1) determinesif the current node, e.g., 200-4, has a terminal node, e.g., 200-6,which stores a delimiter. If not, the CPU 16 (FIG. 1) retraces thetraversal of the tree in reverse from the current node to the nearestnode with a delimiter child (i.e., a terminal) node. The longestmorpheme which may be formed from the remaining characters in thesentence comprises the characters associated with each node traversedfrom the root node to the terminal node in the order of traversal.

As depicted in FIG. 6, the delimiter 200-6 illustratively is an addresswhich points to the location of an entry 305, within a connection codetable (CCT) 105 (see FIG. 5) stored in memory 18 or 22 (FIG. 1). Thisentry 305, shown in greater detail in FIG. 7, contains the connectioncode pair or pairs 105-1 (FRONT CODE 1, BACK CODE 1), 105-2 (FRONT CODE2, BACK CODE 2), . . . , 105-N (FRONT CODE N, BACK CODE N) associatedwith the longest morpheme (constructed by traversing the tree 200-1 ofFIG. 6). Some morphemes can play more than one role and hence typicallyhave more than one pair of connection codes. For instance, the morpheme"work" can be a noun or a verb. To accommodate more than one pair ofcodes the morpheme delimiter 200-6 of the pattern table, preferablypoints to the location of the first connection code pair 105-1.

Some morphemes have synonyms. Illustratively, all synonymous groups ofmorphemes have pointers to only one set of pairs, within the CCT 105(FIG. 5).

As shown in FIG. 7, the other pairs 105-2, . . . , 105-N, are stored inthe CCT 105 (FIG. 5) adjacent to the first pair 105-1. Thus, after theCPU 16 (FIG. 1) retrieves the longest formed morpheme from the patterntable, the CPU 16 (FIG. 1) may easily access the CCT 105 (FIG. 5) usingthe address stored at the terminal node, e.g., 200-6 (FIG. 6). Followingthe pairs of codes 105-1, 105-2, . . . , 105-N is a CCT delimiter 201.As the CPU 16 (FIG. 1) retrieves the pairs 105-1, 105-2, . . . , 105-Nit scans to see if the CCT delimiter 201 is reached. Upon reaching theCCT delimiter 201, the CpU 16 (FIG. 1) stops retrieving codes.

As shown in FIG. 6, the CCT delimiter 201 (FIG. 7) illustratively pointsto the location of an entry in a Japanese to Chinese knowledgedictionary 202. Illustratively, the knowledge dictionary is also storedin memory 18 or 22 (FIG. 1). The CPU 16 (FIG. 1) may use the CCTdelimiter 201 (FIG. 7) to access the corresponding entry in theknowledge dictionary 202. This is useful if the intended application ofthe morphology is to translate written Japanese to Chinese. In such acase, the dictionary entry preferably stores important information forcollocating the particular morpheme within a word graph after thesentence is segmented.

Referring again to FIG. 5, after the CPU 16 (FIG. 1) computes thelongest morpheme in step 102, execution proceeds to step 103 where it isdetermined if any morpheme at all was retrieved. If no morpheme wasretrieved then execution jumps to step 108 where backtrack processing isperformed. In step 108, the index i is decremented so that thepreviously segmented morpheme will be reformed. Control then returns tostep 102 where the next longest morpheme is formed and tested to replacethis previously segmented morpheme.

If in step 103 the CPU 16 (FIG. 1) determines that a morpheme wasformed, execution continues with step 104 where the CPU 16 (FIG. 1)retrieves the connection codes of the longest morpheme from the CCT 105.As mentioned before, the morpheme may have an associated pointer storedtherewith in the dictionary 101 or pattern table 200 (FIG. 6) whichpoints to the appropriate pair or pairs of connection codes in the CCT105. Alternatively, the morpheme is illustratively used as an index inthe CCT 105 and the appropriate pair or pairs of connection codes areretrieved.

Each morpheme has at least one pair of connection codes, a front and aback code. Typically, however, each morpheme has more than one pair ofcodes. In such a case, the CPU 16 (FIG. 1) retrieves all of the codesand test each one in steps 106 and 109 (as discussed below). It is quitepossible that several different codes are valid (enable the morpheme tobe conjunctive). All of the valid codes are retained while the rest areeliminated.

After at least one connection action code pair is retrieved, executionin the CPU 16 (FIG. 1) continues with step 106 in which the longestmorpheme is tested to determine if it is conjunctive with the previouslydivided morpheme. Since no morpheme is divided prior to the firstmorpheme, the CPU 16 (FIG. 1) determines whether the longest morpheme isconjunctive with the beginning of the sentence. Illustratively, the CPUuses a pair of default connection codes for the beginning of thesentence.

In the comparison step 106, the CPU 16 (FIG. 1) tests the longest formedmorpheme by accessing a connection action code table (CACT) 107 storedin the memory 18 or 22 (FIG. 1) using the connection codes of thelongest and previous morphemes. The CPU 16 (FIG. 1) performs one accessfor each permutation of one front connection code selected from thepairs of connection codes of the longest untested morpheme and one backconnection code selected from the pairs of the previous morpheme.Illustratively, a default back connection code is provided if there isno previously divided morpheme.

An exemplary segment 203 of a CACT 107 is depicted in FIG. 8.Illustratively, the CACT segment 203 is divided into two tables 204 and205 which are stored in memory 18 or 22 (FIG. 1) although a single tablecould also be used. The first table 20 has three hundred entries eachstoring three hundred possible back connection codes of morphemes. Thefirst table also stores with each entry the corresponding number offront connection codes of morphemes which are conjunctive to the backconnection code of the entry. For instance, a morpheme having a backconnection code of `one` is conjunctive with morphemes having one ofeleven particular front connection codes. This is illustrated by tableentry 206-1. Similarly, table entry 206-2 indicates that a morpheme witha back connection code of `two` is conjunctive with a morpheme havingone of twelve particular front connection codes. The second table 205stores all of the front connection codes which may connect with eachback connection code in the following manner. Locations zero through tenstore the eleven post morpheme front connection codes which may beconnected with the previous morpheme back connection code of `one`.Thereafter, locations eleven to twenty-two store the post morpheme frontconnection codes which may be connected with the previous morpheme backconnection code of `two` and so on. Thus, in order to compare a pair offront and back connection codes one must know the number of frontconnection codes that the back connection code may connect with and theoffset table entry in table 205.

As shown by FIG. 8, the back connection code of the previous morphemeand the front connection code of the above-formed longest morpheme indexa third entry which illustratively is a number from 1 to 59. Each numberrepresents the code of a particular connection action. The connectionaction is a description of the relationship between two morphemes. Ifthe two connection codes index a connection action code in the CACT 107(FIG. 5) then the particular indexed connection action describes therelationship between the two morphemes. This means that the twomorphemes are conjunctive. If no connection action exists for a pair ofconnection codes then the two morphemes are not conjunctive by means ofthis pair of codes. In this manner, as the CPU 16 (FIG. 1) tests themorphemes, the number of connection code pairs associated with theuntested morpheme is pruned. For instance, suppose an i^(th) morphemehas five connection code pairs, of which three pairs have a frontconnection code which relates it to the i-1^(th) morpheme. In this case,the CPU 16 (FIG. 1) prunes or eliminates two pairs of connection codesof the i^(th) morpheme as they do not relate the i^(th) morpheme to itspredecessor. On the other hand, suppose that the i+1^(th) morpheme hasthree pairs of connection codes, none of which have front connectioncodes which are conjunctive to any of the back connection codes of thethree remaining pairs of the i^(th) morpheme In such a case, the CPU 16(FIG. 1) reforms the i+1^(th) morpheme in the order of decreasinglength.

Execution in the CPU 16 (FIG. 1) next proceeds to step 109 where,depending on the success of testing the morpheme, the morpheme will bereformed or the next morpheme will be formed. In the case no connectionaction is indexed under any above-described permutation of two codes,formed from a back code of a prior morpheme and a front code of asubsequent morpheme, execution in the CPU 16 (FIG. 1) then jumps back tostep 102 where a different morpheme is formed from the remainingcharacters in the sentence. In the reformulation (step 102), the next tolongest morpheme may be formed from the remaining characters and testedto determine if it is conjunctive, in the above-described manner, usingthe CPU 16 (FIG. 1). If, on the other hand, no such conjunctive morphememay be formed, then backtracking occurs (via steps 103 and 108), i.e.,the previously divided morpheme will be redivided (i.e., the next tolongest morpheme will be formed and tested from the characters whichcomprise this morpheme) by the CPU 16 (FIG. 1). In such a case, the CPU16 (FIG. 1) decrements the morpheme counter i by one and redivides thelast divided morpheme. Again, if no such morpheme may be formed then thenext previously divided character will be redivided (again, the morphemecounter i being decremented by one), etc. After a new morpheme isformed, execution in the CPU 16 (FIG. 1) otherwise proceeds in a normalfashion as described above.

In the case that the above-described longest morpheme is conjunctivewith the previously divided morpheme, execution in the CPU 16 (FIG. 1)continues with step 110. In step 110, the CPU 16 (FIG. 1) records theconnection codes of the above-described longest morpheme and themorpheme is considered divided. Additionally, the respective connectionaction codes which describe the relationship of the longest morpheme tothe previous morpheme are also recorded by the CPU 16 (FIG. 1). Thevariable i, which indicates the number of the next morpheme to bedivided, is incremented and the characters which comprise this morphemeare removed from the rest string by the CPU 16 (FIG. 1). Execution inthe CPU 16 (FIG. 1) then proceeds with step 112 in which it isdetermined whether the end of the sentence has been reached. If the endof the sentence has been reached, execution proceeds in the CPU 16(FIG. 1) to step 118. If not, execution proceeds to step 114.

In step 114 the CPU 16 (FIG. 1) determines if look-ahead processing isnecessary. Specifically, in step 114, it is determined if a connectionaction, which related the current divided morpheme to the previouslydivided morpheme, was a particular action. Illustratively, in step 114,it is determined whether the corresponding connection action code was 40and the back connection code of the current divided morpheme is 124,131, 141-149, 152 or 169. This would indicate that the previouslydivided morpheme was a word and that the current divided morpheme is astem. In such a case, the next morpheme must be a leaf having only onecharacter. If the connection action code was 40 and the last dividedmorpheme has one of the above mentioned back connection codes, executionin the CPU 16 (FIG. 1) illustratively jumps to step 116. Otherwise,execution jumps to step 102 and the CPU 16 (FIG. 1) divides the nextmorpheme from the remaining characters in the sentence. It may beappreciated that different morphemes may be subsequently formed as thenext morpheme depending on which connection codes of the last formedmorpheme are used. Preferably, execution proceeds to the look-aheadprocessing step 114 if any connection action code is 40. Otherwiseexecution in the CPU 16 (FIG. 1) proceeds to step 102 to form thelongest subsequent morpheme which is conjunctive by means of any backconnection code of the previous morpheme.

In step 116, the CPU 16 (FIG. 1) forms the next morpheme from the singlenext character of the remaining characters in the rest string(sentence). This procedure, referred to as look-ahead processing,exploits a natural constraint of the Japanese language--that themorpheme following the sequence of a word followed by a stem must be aleaf (a single character). After forming the next morpheme from thesingle next character, execution in the CPU 16 (FIG. 1) jumps to step104 and otherwise proceeds normally. Because in look-ahead processing,the next morpheme to be divided has already been formed, steps 102-103are skipped.

When the end of the sentence is reached, execution proceeds from step112 to step 118. In step 118, the CPU 16 (FIG. 1) constructs a wordgraph using the connection actions and the knowledge informationassociated with each morpheme. To achieve this end, the CPU 16 (FIG. 1)executes procedures associated with each connection action code. FIG. 9depicts a generalized table of connection action codes, and a briefdescription of the procedure executed in accordance with the connectionaction codes. After all of the connection action codes are executed, thegraphed sentence is outputted by the CPU 16 (FIG. 1) from the dataprocessing system 14 (FIG. 1).

A sample segmentation of a Japanese sentence will now be described withreference to FIGS. 5 and 10. Depicted in FIG. 10 is a table showing thestate of the morphology after the CPU 16 (FIG. 1) executes a particularstep of FIG. 5. The first column of the table represents the value ofthe variable i and describes which morpheme is currently being divided.The second column represents the step of FIG. 5 which was executed bythe CPU 16 (FIG. 1) to place the morphology in the state of thatparticular line. The third column displays the current morpheme that isin the process of being divided. The fourth column shows the connectioncodes of the morpheme currently being divided. The fifth column displaysthe connection action code which relates the current morpheme to theprevious morpheme and the final column displays the rest string aftereach step is executed. For purposes of brevity, the details of thelongest match step 102 are not shown.

Initially, in step 100, a sentence of characters is read in and storedin the rest string. The morpheme counter, i, is initialized to one andthe beginning of the sentence (represented by the "<" character) isassigned the default pair of connection codes (i.e., front connectioncode, "2" and back connection code "269"). Next, step 102 is executeduntil the longest possible morpheme that may be extracted from the reststring is retrieved. Since a morpheme is formed (step 103), step 104 isexecuted and the connection code pair (i.e., front connection code "1"and back connection code "54") is retrieved for the first formedmorpheme. Next, step 106 is executed to determine if this morpheme isconjunctive with the beginning of the sentence. Since the pair (269, 1)index the connection action code 10, the morpheme is conjunctive withthe beginning of the sentence. Thus, in steps 109-114, the connectionaction, 10, is recorded, the morpheme counter i is incremented andexecution proceeds back to step 102 where the next morpheme is dividedfrom the remaining characters in the sentence.

As can be seen from FIG. 10, morphemes two through four each comprisetwo characters and respectively have the connection code pairs, (1, 74),(205, 164), (1, 54). Morphemes two through four are respectivelyconnected by the connection actions corresponding to codes 10, 30 and44. Similarly, as each morpheme is determined, the characters remainingin the rest string are reduced.

Finally, the eighth morpheme is formed. A determination is made that theeighth morpheme is connected to the seventh morpheme by the connectionaction corresponding to code 40. Further, the back connection code ofthe eighth morpheme is 148. Thus, look-ahead processing is enabled instep 116. The ninth morpheme is formed from the single next character ofthe rest string. Thereafter, execution proceeds immediately to step 104where the connection codes (142, 137) are retrieved from the CCT for theninth morpheme. It is then determined, in step 106 that the ninthmorpheme is connected to the eighth morpheme by the connection actioncorresponding to code 30.

Referring now to FIG. 13, an exemplary table is depicted showing theresults of the execution by the CPU 16 (FIG. 1) over an entire sentenceprior to executing step 118 of FIG. 5. The first column displays thenumber of each morpheme. The second column displays the segmentedmorpheme. The third column displays the connection action code relatingthe morpheme of that particular line of the table to the previousmorpheme. Finally, the last column displays the connection code pair ofeach morpheme.

In step 118 of FIG. 5, procedures, generally described in FIG. 9, areexecuted, and a knowledge dictionary consulted, to construct a wordgraph. Generally, the execution of connection action codes involvesdeleting morphemes associated with certain action codes. For instance,each morpheme connected by connection action code 30 is deleted. FIG. 12illustrates the results of the execution the connection actions on thedata of FIG. 11. The first and second columns are as before. The thirdcolumn displays the role each morpheme plays in the sentence and thefourth column displays useful information which relates the morphemes.The data of table 13 may then be outputted for further processing suchas executing computer functions as instructed by the inputted sentenceor translating the written sentence to another language.

Finally, the above-mentioned embodiment is intended merely to illustratethe invention. Numerous other embodiments may be envisioned by thoseskilled in the art without departing from the spirit of the followingclaims.

We claim:
 1. A method of morphologizing a sequence of characters intomorphemes of a sentence in a data processing system having a CPU andmemory comprising the step of:in the CPU, electronically dividing eachmorpheme of a sequence of morphemes, one at a time from the beginning ofthe sequence, said dividing step including the steps of, from asubsequence of remaining undivided characters of said sequence ofcharacters beginning with a first character of said sequence ofundivided characters, if said morpheme is a first morpheme of saidsequence of morphemes, otherwise beginning with a first characterimmediately following a previously divided morpheme, electronicallyforming a morpheme which is grammatically conjunctive with the beginningof the sentence, if said morpheme is said first morpheme, otherwise witha previously divided morpheme of said sequence of morphemes, and, ifnecessary, to form said grammatically conjunctive morpheme,electronically redividing a previously divided morpheme, said formingstep comprising the steps of: in the CPU, electronically identifying alongest untested morpheme, from said subsequence of remaining undividedcharacters, which is also listed in a dictionary stored in memory; inthe CPU, retrieving from said memory at least one pair of front and backconnection codes associated with said longest untested morpheme from afirst table; in the CPU, retrieving one action code, from a second tablestored in memory, indexed by each combination of one front connectioncode of said longest untested morpheme and, if said morpheme is saidfirst morpheme, a default back connection code is supplied, otherwise,one back connection code of said previously divided morpheme retrievedfrom said memory, and electronically eliminating, from said longestuntested morpheme, all pairs of connection codes for which said CPUfails to retrieve any action codes; and if the CPU fails to retrieve anyaction codes, , electronically redividing said longest morpheme in theCPU.
 2. The method of claim 1 further comprising:in the CPU,electronically deleting particular morphemes in response to certainconnection action codes.
 3. The method of claim 1 further comprising:ifparticular action codes are retrieved by the CPU for a particularmorpheme, electronically dividing a next morpheme succeeding saidparticular morpheme by electronically forming said next morpheme with asingle character following said particular morpheme.
 4. The method ofclaim 1 wherein said step of identifying the longest untested morphemecomprises using a dictionary having one tree for storing all of themorphemes beginning with each character, each of said trees havinginterconnected non-terminal nodes associated with one character andterminal nodes associated with delimiters such that a traversal of apath of said tree, from the root of said tree to any terminal node,spells a morpheme, wherein said identifying step further comprises:inthe CPU, electronically selecting a tree whose root node is associatedwith said first character of said sequence of characters, if saidlongest untested morpheme is said first morpheme of said sequence ofmorphemes, otherwise, with said first character immediately following apreviously divided morpheme; in the CPU, until a node is reached devoidof children nodes associated with the next character of said subsequenceof remaining undivided characters, electronically traversing saidselected tree to a child node associated with said next character ofsaid subsequence of remaining undivided characters; and in the CPU,electronically retracing said traversal of said selected tree to thenearest node having a terminal node and electronically forming saidlongest untested morpheme from the characters associated with each nodetraversed from said root to said terminal node, in order.
 5. The methodof claim 4 wherein said delimiter points to the location in said firsttable of said front and back connection code pairs associated with saidlongest untested morpheme.
 6. A method for morphologizing, in a dataprocessing system having a CPU and a memory, the characters of asentence into morphemes comprising the step of:in the CPU,electronically dividing each morpheme of a sequence of morphemes, one ata time from the beginning of the sentence, by electronically forming amorpheme from a sequence of remaining undivided characters of thesentence beginning with a first character of said sequence of undividedcharacters, if said morpheme is a first morpheme of said sequence ofmorphemes, otherwise beginning with a first character immediatelyfollowing a previously divided morpheme, which is grammaticallyconjunctive with the beginning of the sentence, if said morpheme is saidfirst morpheme, beginning otherwise with a previously divided morphemeof said sequence of morphemes and, if necessary to form saidgrammatically conjunctive morpheme, by electronically redividing apreviously divided morpheme, said forming step comparing the steps of:(a) in the CPU, electronically identifying a longest untested morphemeof the sentence from said sequence of remaining undivided characters ofthe sentence which is listed in a dictionary stored in memory;(b) in theCPU, electronically retrieving at least one pair of front and backconnection codes of said longest untested morpheme from a first tablestored in memory; (c) in the CPU, electronically testing said longestuntested morpheme by retrieving one action code, from a second tablestored in memory, indexed by each combination of one front connectioncode of said longest untested morpheme and, if said morpheme is saidfirst morpheme, a default back connection code is supplied otherwise,one back connection code of said previously divided morpheme retrievedfrom said memory, and electronically eliminating, from said longestuntested morpheme, all pairs of connection codes for which said CPUfails to retrieve any action codes; (d) if said CPU fails to retrieveany action codes from said second table in step (c), returning to step(a); (e) if a particular one of said action codes is retrieved from saidsecond table, electronically forming a next morpheme from a singlecharacter of said sequence of remaining undivided characters followingsaid morpheme for which said particular action code was retrieved andreturning to step (b); and (f) until the end of the sentence is reachedby the CPU, returning to step (a).
 7. The method of claim 6 wherein saidmethod is used for translating Japanese to Chinese and furthercomprises:in the CPU, electronically replacing said morphemes withChinese morphemes using a Japanese to Chinese knowledge dictionarystored in memory.
 8. The method of claim 6 wherein said step ofidentifying the longest untested morpheme comprises using a dictionaryhaving one or more trees for storing all of the morphemes beginning witheach character, each of said trees having interconnected non-terminalnodes associated with one character and terminal nodes associated withdelimiters such that a traversal of a path of said trees, from the rootof said tree to any terminal node, spells a morpheme, wherein saididentifying step further comprises:in the CPU, electronically selectinga tree whose root node is associated with said first character of saidsequence of characters, if said longest untested morpheme is said firstmorpheme of said sequence of morphemes, otherwise, with said firstcharacter immediately following a previously divided morpheme,; in theCPU, until a node is reached devoid of children nodes associated withthe next character of said sequence of remaining undivided characters insaid sentence, electronically traversing said selected tree to a childnode associated with the next character of said sequence of remainingundivided characters in said sentence; and in the CPU, electronicallyretracing said traversal of said selected tree to the nearest nodehaving a terminal node and electronically forming said longest untestedmorpheme from the characters associated with each node traversed fromsaid root to said terminal node, in order.
 9. The method of claim 6wherein said delimiter points to the location in said first table ofsaid front and back connection code pairs associated with said longestuntested morpheme.
 10. A text processing system comprising:an opticalscanner for generating a scanned sequence of characters; a dataprocessing system connected to said optical scanner for receiving andfor morphologizing said scanned sequence of characters, said dataprocessing system comprising: a memory for storing a dictionary of validmorphemes, a first table of connection code pairs containing front andback connection codes associated with each morpheme and a second tableof connection action codes corresponding to a back connection code of apreceding morpheme and a front connection code of a subsequent morphemefor every valid adjacent placement of preceding and subsequentmorphemes; a CPU, connected to said memory, for receiving each sequenceof characters, for dividing a sequence of morphemes of a sentence, oneat a time from the beginning of said sequence, by forming a longestmorpheme from a sequence of remaining characters of one of said receivedsequences of characters beginning with a first character if saidmorphemes is a first morpheme of said sequence of morphemes, otherwise,beginning with a first character immediately following a previouslydivided morpheme which is both listed in said dictionary and conjunctivewith the beginning of the sentence, if said morpheme is said firstmorpheme, otherwise, beginning with said previously divided morpheme, byredividing a previously divided morpheme if necessary to form saidconjunctive morpheme and by testing each formed morpheme by retrievingone or more pairs of front and back connection codes associated withsaid longest morpheme from said first table stored in memory, retrievingone connection action code, from said second table stored in memory,indexed by each combination of one front connection code of said longestmorpheme and, if said morpheme is said first morpheme, a default backconnection code is supplied, otherwise, one back connection code of saidpreviously divided morpheme retrieved from said memory, eliminating allconnection code pairs, from said longest morpheme, for which said CPUfails to retrieve any action codes, redividing said longest morpheme ifsaid CPU fails to retrieve any action codes and, if particular actioncodes are retrieved, dividing a next morpheme by forming said nextmorpheme with a single character following said morpheme for which onefrom said particular action codes was retrieved of said sequence ofremaining characters and testing said morpheme.